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We consider the question of whether thermodynamic macrostates are objective consequences of 
dynamics, or subjective reflections of our ignorance of a physical system. We argue that they are 
both; more specifically, that the set of macrostates forms the unique maximal partition of phase 
space which 1) is consistent with our observations (a subjective fact about our ability to observe 
the system) and 2) obeys a Markov process (an objective fact about the system's dynamics). We 
review the ideas of computational mechanics, an information- theoretic method for finding optimal 
causal models of stochastic processes, and argue that macrostates coincide with the "causal states" 
of computational mechanics. Defining a set of macrostates thus consists of an inductive process 
where we start with a given set of observables, and then refine our partition of phase space until we 
reach a set of states which predict their own future, i.e. which are Markovian. Macrostates arrived 
at in this way are provably optimal statistical predictors of the future values of our observables. 
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I. WHAT'S STRANGE ABOUT MACROSTATES, OR, IS IT JUST ME? 

Almost from the start of statistical mechanics, there has been a tension between subjective or epistemic in terpre- 
tations of entropy, and objective or physical ones. Many writers, for instance the late E. T. iJavnesI l)l983fl . have 
vigorously asserted that entropy is purely subjective, a quantification of one's lack of knowledge of the molecular state 
of a system. It is hard to reconcile this story with the many physical p rocesses w hich are driven by entropy increase, 
or by competition between maximizing two different kinds of entropy ijFoxl Il988(l . These processes either happen or 
they don't, and observers, knowledgeable or otherwise, seem completely irrelevant. In a nutshell, the epistemic view 
of entropy says that an ice-cube melts when I become sufficiently ignorant of it, which is absurd. 

These difficulties with entropy are only starker versions of the difficulties afflicting all thermodynamic macroscopic 
variables. Their interpretation oscillates between a purely epistemic one (they are the variables which we happen to 
be willing and able to observe) and a purely physical one (they have their own dynamics and have brute physical 
consequences, e.g. for the amount of work which engines can do). These difficulties are inherited by our definition 
of macrostates. Standard references define macrostates either as sets of microstates, i.e. subsets of phase space , 
with given values of a small number of macr oscopic observables dBaierleinl Il999t lLandau and Lifshitd . Il98fi iReichll 
1980), or probability distributions over these (jBaliarJ . Il99li : iRiielleL Il989() . A given set of observables induces a set of 
macrostates, which form a partition of the phase space; 1 but why is one such partition better than another? 

There are generally several different sets of macroscopic variables which can be observed a given system. In some 
cases, different sets of observables are equivalent, in the sense that they induce the same partition of the phase space, 
and so their macrostates are in one-to-one correspondence; for instance, for an ideal gas with a constant number of 
molecules, we obtain the same macrostates by measuring either pressure and volume, or temperature and entropy. In 
other cases, observing different sets of variables will partition the set of microstates in different ways — producing 
partitions that are finer, coarser, or incomparable. 

Even if we restrict our attention to extensive variables, there is a hierarchy of increasingly disaggregated, fine- 
grained levels of description, with associated macroscopic variables at each level. At the highest and coarsest level 
are thermodynamic descriptions, in terms of system-wide extensive variables or bulk averages. Below them are 
hydrodynamic descriptions, in terms of local densities of extensive quantities. Below them is the Boltzmannian level, 
described with occupation numbers in cells of single-molecule phase space, or, in the limit, phase-space densities. 
(Below the Boltzmannian level we get densities over the whole-system phase space, and so statistical mechanics 
proper.) We can sometimes demonstrate, and in general bel ieve, that we can obtain the coarser descriptions from the 
finer ones by integration or "contraction" ( cf. iKeizerl Il987l ch. 9). Thus there are many hydrodynamic macrostates 
for a given thermodynamic one, i.e. the hydrodynamic partition is much finer. 

Clearly there is a problem here if macrostates are purely objective. In that case, we should be forced to use one 
level of description. On the other hand, we can formulate and test theories at all levels of description, and we know 
that, for instance, both thermodynamic and hydrodynamic theories are well-validated for many systems. 

We hope to offer a resolution along the following lines. Intelligent creatures (such as, to a small extent, ourselves) 
start with certain variables which they are able to observe, and which interest them. This collection of variables defines 
a partition of phase space. This partition may not be an optimal predictor of its own future; it may have non-Markovian 
dynamics, with unaccounted-for patterns in its variables' time series. In such cases, intelligent observers postulate 
additional variables, attempt to develop instruments capable of observing them, and thus refine this partition. Our 
proposal is that good macrostates are precisely the partitions at which this process terminates, i.e. refinements of the 
observational states whose dynamics are Markovian. One can show that there is a unique coarsest such refinement, 
given the initial set of observables, and that this refinement is provably optimal in several senses as a statistical 
predictor of the future. To prove these things we must give a brief summary of the theory of the causal architecture 
of stochastic processes, known as computational mechanics. 

Definitions of Markov processes and some basic information theory can be found in the Appendix. 

II. COMPUTATIONAL MECHANICS AND CAUSAL STATES 

In the late 1970s and early 1980s, workers in nonlinear dynamics developed methods, called "attractor reconstruc- 
tion" or "geometry from time series," to reconstruct the vector field of a dynamical system from time series of mea- 
surements of some function of the state ijKantz and Schreiberl Il997t iPackard. Crutchfield. Farmer and Shawl I198C ; 
lTakenslfl98l|) . Inspired by this, since 1989 Crutchfield et al. ijCrutchfield and Feldmanlll997HCrutchfield and S halizi. 



iTolmarJ Jl93Sl . p. 77n), while objecting to the name "macrostate," regards it as a partition, as above. 
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rutchfield and YounglllQSfltlFeldman and Crutchfield! ll998HShahzi and Crutchfield 120011) have formulated a 
theory, "computational mechanics," which constructs, from observations of a stochastic process, the minimal model 
capable of generating that process. Put differently, they have developed a technique for discovering and representing 
all the predictive patterns in a time series. 

The minimal model produced by computational mechanics represents the causal architecture of the process, or 
alternately how it stores and processes information (hence the term computational mechanics). If statistical mechanics 
is a "forward" approach, deriving macro-consequences of micro-dynamics, computational mechanics is an "inverse" 
approach, finding minimal causal architectures capable of producing the statistics of observed time series. 

The key notion in computational mechanics is that of causal state, which works like this. We observe a stochastic 
process, which we break at arbitrary points into "histories" and "futures." Two histories belong to the same causal 
state if and only if they are equivalent for predicting the future, i.e., if they lead to the same conditional probability 
distribution for futures. 

Formally, consider a discrete-time stochastic process, stretching to infinity in both directions: . . . St_i, St, s t+ i . . . = 
s . Break the process into two parts: one, the "past" or "history," is all the values up to and including time t, which 
we write St; the other, the "future" St, all the values after that time. We write S and S for the set of all possible 
histories and futures respectively. If the system is in equilibrium and this time series is stationary, we can drop this 
subscript, and we assume this for now. We wish to predict 1? on the basis of s . 

Now, any prediction method treats some histories as equivalent to each other; for instance, if we model the system 
as only depending on its k previous values, two histories which last differed k time-steps ago are equivalent. Moreover, 
all we need to predict about the future is which equivalence class we will find ourselves in. Thus any prediction method 
induces a partition over the phase space, and for any partition there is some probability distribution over the future 
equivalence classes given which one we are in now. 

There are many ways to construct partitions about which we can make correct predictions. For instance, we could 
lump all of phase space into a single, trivial "macrostate," and announce that given we are in that state now, we 
will be in the future as well. But obviously such a scheme fails to capture anything important about the system. 
Computational mechanics seeks a partition which is as coarse as possible, but which captures all available information 
about the time series of observables. That is, it gives optimal predictions of our observables while attributing as little 
structure to the system as possible. 

With this in mind, we claim that the optimal partition is simply the following. Say that two histories V, V' are 
causally equivalent if and only if they give the same conditional distribution for futures: 

s~~ £ V iff Prps*|V] = Pr[~?|V] 

This relation ^ £ is symmetric, reflexive and transitive, and thus divides the set S of all pasts into equivalence classes. 
We define the causal state of a history as its equivalence class under ~ £ : 

e(V) = {V'|V~ e V} 

Then it is clear that the dependence of the system on its past is completely captured by its causal state, 

Pr[s t+ i\%] = Pr [si+i|e(V t )] 

We now formalize our claim that the causal states form an optimal partition of histories. For a given partition 1Z of 
histories into equivalence classes, we consider the mutual information I[s ; 1Z] between it and the system's future (for 
definitions of information-theoretic quantities, see Appendix^). This quantity is limited by the mutual information 
between the system's past an d its future, 7[s;7£] < I[s; s]. We call partitions which attain this limit prescient. 
Shal izi and Crutchfield! l)200l|) showed that the causal states are prescient. Moreover, if two histories are equivalent 
with respect to some prescient partition, they are almost always in the same causal state as well. Thus, except for a 
set of measure zero, any prescient partition is a refinement of the causal states, so the causal states form the coarsest 
possible prescient partition. 

Moreover, the causal states are the least complex prescient partition in the following sense. Given a partition into 
equivalence classes 1Z we define its statistical complexity as the entropy H [Tt]. The statistical complexity is the amount 
of information which the partition encodes about the past. Normally in statistical mechanics we seek to maximize the 
entropy, but here entropy measures, not the unbiasedness of a distribution, but the complexity of a predictor, and 
so in minimizing it we are applying Occam's Razor. We can therefore say that the optimal predictor is the prescient 
predictor of minimal statistical complexity. In fact, for all prescient partitions 1Z we have H\7Z] > H\ e (*F)]. so the 
causa l states are th e unique prescient states of minimal statistical complexity ijShalizi and Crutchfield! l)200l|) . App. 
D of l|Shalizil200l[) ). 

Because the causal states are optimal predictors, we can say that the statistical complexity of the process is equal 
to their statistical complexity; we denote this C M = _ff[e(V)]. The statistical complexity of the process is therefore 
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the amount of information about its past which is relevant to its future, the amount retained internally, so to speak. 
It also has a nice physical interpretation connected to thermodynamic entropy which we give in the next section. 

A common measure of t he complexity of a stochastic process , and of the amount of informa tion stored in it, is 
the predictive informatio n llBialek. Nemenman and Tishbvl l200l|) . effective measure complexity ijGrassbergerl 1 1986) 
or "stored information" ijShawl Il984|) . E = Since the causal states are prescient, I[~s; s ] = 7[^ > ;e( < i~)] < 

if [e( s )] (see Appendix El, so C M > E. While a complete knowledge of th e causal state reduces the unce rtainty in 
the future by E bits, one needs bits to make this prediction. In general ijShalizi and Crutchfieldl 12001). E < C^, 
and for the specific case of first-order Markov processes, C M = E + h^, where = H[st+i \ s t ] is the entropy rate 
of the process, also equal to H[s t +i\e(^s t )]- These notions will be useful later on, when we propose a definition of 
emergence based on the ratio between predictive information and statistical complexity. 

Readers familiar with the literature on statistical explanation will recognize that the the causal state partition 
can also be though o f as an application of Salmon's notion of a "statistical relevance basis" to stochastic processes 
l|SalmorllT97ll 119841) . 

If we consider the time series of causal states . . . , e(st-i), e(st), e(st+i), . ■ ., we have a new stochastic process. 
Moreover, since each causal state e(st) contains all the relevant information about its entire past Vt, this is a Markov 
process, i.e. the probability distribution of futures depends only on the current state (see Appendix lB|l . Thus we have 
collapsed the original process, regardless of its dependence on its history, into a Markov process — but one which 
contains all relevant information about the original process. 

The observed process is a random function of this Markov process, i.e., a kind of "hidden Markov model". The 
Markov properties of the causal states justify in part their n ame, sin c e they are exactly the "screening-off" properties 
that have long been recognized as essential to causati on l|Sa moj . I1984J), and which form the basis of statistical 

methods of causal inference for non-dynamical systems ijPearll 200( : ISpirtes. Glvmour and Scheinesl [2001(1 . 

< — > 

If the underlying process S is not stationary, all is not lost. Formally, in fact, we can generalize the theory to 
arbitrary processes. For our purposes here, however, we only need the idea of a conditionally stationary process, 
which is to say one in which Vr(^s t \^t — < s~) = Pr( s |*s~o = *s~)> f° r a U times t and histories V. The above theory 
then carries over directly (stationary processes are all conditionally stationary as well), with the exception that the 
probability distribution of the causal states, and H[S], can be a function of time. 

We have spoken througho ut as we knew all the necessary conditional probabilities exactly. This is sometimes the 
case with analytical models (|Feldman and CrutchfieldL l 1998), but never with experimental data. However, there are 
reconstruction algorithms whic h, under mi l d stat istical assumptions, will converge to the correct causal states, given 
sufficient experimental data (jShalizi et all 120 02). For our present purposes, it is enough to know that causal states 
can be inferred reliably from observations. 



III. CAUSAL STATES FROM COARSE-GRAINED OBSERVATIONS 

Consider our favorite statistical-mechanical system. It has a phase space T, every point q of which is a complete 
specification of the positions, momenta, spins, etc. of all particles. Discretizing time, as is common, we say that the 
evolution on T is governed by an operator T: q t+ i = Tq t . We do not rule out the possibility that T is stochastic, but 
we insist that there are no "hidden variables," so that {q t } form a Markov chain. We are concerned with an ensemble 
of such systems, so we write the random variable for the current microstate Q. We do not assume that the ensemble 
is any of the usual thermodynamic ensembles, or even that the distribution of Q is invariant. 

The system changes over time. We probe it with observations of limited precision at each time step. What our 
probes give us are many-one functions of the location in phase space. More formally, the observation process is 
represented by a function / : T h 5, where S is our favorite (possibly multi-dimensional) space for representing 
observations 2 . This function / partitions the phase space T, i.e., it divides it into mutually exclusive and jointly 
exhaustive sets, on each of which / takes a unique value. Let the partition of T induced by / be T. Then /(Q) = St 
defines another process, which need not be Markovian. Call it the observed process. 

We now form the causal states of the observed process. Our observations, as noted, induce a partition of the 
phase space. Therefore a sequence of observations induces a refinement of that partition. Each observation value 
x corresponds to a set T x of points in phase space. The sequence of observations x,y thus corresponds to the set 
!F x> y = Ty n TJ- X , i.e., those points at which we observe y now, and at which we would have observed x one step 



2 Strictly, / should map to a distribution of observed values, to represent fluctuations and noise, in which case macroscopic states would 
be defined by probability distributions of macroscopic variables. This, however, would increase the complexity of our exposition without 
a corresponding gain in insight, so we will pretend that our observations are exact and noiseless. 
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back. The sets T x , y are a refinement of the observed partition J 7 , and we can extend this to countable sequences of 

observations. The set of causal states, <S, is a partition on the set S of observational histories. Therefore it induces a 
partition on T which is a coarsening of the partition induced by infinite- length histories. Call this partition Q. Each 
causal state therefore corresponds to a region of phase space, which in principle is accessible to some coarse-grained 
observational procedure. Observations of this variable form a new stochastic process {S t } which is Markovian, and a 
knowledge of S t is all that is needed to predict s t optimally. 

Let us write the partition on S induced by the present observation as X. What is the relationship between the 
causal partitions, Q and S, and the corresponding observational partitions, T and X? There are four possibilities: 

1. The observational and the causal partitions are the same. 

2. The causal partition is a refinements of the observational one. 

3. The causal partition is coarser than the observational one. 

4. The causal and observational partitions are incomparable. 

In the next sections we explain the physical meaning of cases (1-3), and give physical examples of them. 

A. The Observables Define a Macrostate 

Suppose two observational histories, . . . St-2St-iSt and . . . s't'-2s't'-is't> are causally equivalent iff s* = s't>. This 
means that the current causal state is defined by the current values of the macroscopic observables, and conversely 
any difference in a macroscopic observable means a difference in causal state. In the notation introduced earlier, 
<S = X iff T = Q . The macrostates then have all the properties of causal states. Their dynamics are Markovian 
and statistically reproducible, and no prediction of future values of the macrovariables can be better than one based 
simply on their present value. An obvious example is the combination of pressure, volume and temperature for an 
ideal gas near equilibrium. 

In such cases, we can give a nice interpretation to the statistical complexity C M . Recall that C M = H[S], the amount 
of information needed to specify the causal state. Because the causal state and the macrostate S are equivalent, 
H[S] = H[S}. But S = /(Q), so Jf[5|Q] =0 — if we knew the exact microstate, there would be no uncertainty 
in the macrostate. Now, for any two random variables, H[X,Y] = H[X] + H[Y\X). Let us make both possible 
decompositions of H [Q, S}. 



That is, the statistical complexity is just the amount of information about the microstate that is contained in the 
macrovariables . 

Since the macrostates form a first-order Markov chain, there is, as we mentioned above, a simple relationship between 
the statistical complexity, the entropy rate, and the predictive information, viz., E = C M — h^. Since E = 7["?;<S], 
and hf, = H[Si\^}, we have I[~s;S] = H[S] - i/fs^S]. 

B. The Causal States Are Finer than the Macrostates 

Suppose Q is a refinement of J 7 , or, equivalcntly, S is a refinement of X. Then, in addition to knowing the current 
values of the macrovariables, we must know something of their history as well. Or, more exactly, if we do not, we 
do not have a causally complete set of macrovariables, and the observed dynamics are not only non-Mar kovian, they 
cannot be optimally predicted. However, they can be optimally predicted from a knowledge of S, whose time-evolution 
is Markovian. Moreover, if we know what cell of Q the system is in, we know the value of <S, which suggests that, in 
principle, there is a observational procedure which will tell us how to optimally predict our original macrovariables. 

We can go one step further, however, by invoking a result about refinements of partitions (see the appendix). 
Suppose A is a partition and B is a refinement of it. Then there exists at least one minimal factor partition C such 
that B is the product of A and C, and this is not true of any partition with fewer cells than C. Since Q is a refinement 
of T, there is therefore at least one Z such that Q = T ' ■ Z. If we observe the macrovariablc Z corresponding to Z 



H[Q\S}+H[S] 
ff[Q|S] + C„ 



= H[S\Q]+H[Q] 

= H[Q] 

= H[Q] - H[Q\S] 

= I[Q;S\ 
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together with our original macrovariables, it is the same as if we had observed the causal state directly, and so we get 
a causally complete set of macrovariables and nice macrostates. In other words, we if observe either S or (5, Z), we 
reduce the present case to that in the previous subsection. 

Generally, there are a very large number of minimal factor partitions which can take the role of Z. Which one we 
observe is dictated by practical considerations — experimental accessibility, smoothness of the resulting macrovariablc 
over phase space, degree of uncertainty in the macrovariable, etc. This should not be worrisome, however, since there 
are elementary cases where we can complete a set of macrovariables in more than one way. Given pressure and volume 
for an ideal gas, for instance, we get the same macrostates from observations of temperature or molecule number. 

It is worth noting that the minimal factor partition Z is incomparable to and so in some sense orthogonal or 
unpredictable from T . Clearly, there is a bijection between S and (S, Z). Hence H [S] = H[S, Z) = H[Z\S] + H [S]. 
Since H[S] does not depend on our choice of factor variable Z, it follows that /Z"[Z|5] is the same for all factor variables, 
i.e., they all have the same degree of uncertainty remaining once we know the original observables. Furthermore, all 
the information they contain is relevant to the causal state: 

I[S;Z] = H[S]-H[S\Z] 

= H[S,Z] -H[S,Z\Z] 
= H[S,Z]- (H[S,Z,Z]~ H[Z]) 
= H[S,Z] - H[S,Z]+ H[Z] 
= H[Z] 

Similarly, I[S; Z\S] = H[Z\S], which is independent of the factor partition we use. 

Sadly, there is no guarantee that any of the factor partitions are experimentally accessible, still less accessible by 
practical or easy experimental procedures. In such cases, however, we may still eliminate memory effects from our 
models by constructing the causal states from observational histories. 3 

For an example of this method in (unwitting) action, consider hysteresis in ferromagnets. The response of a 
ferromagnetic substance to a magnetic field can be treated, equivalently, either as a function of its past history of 
applied fields, or as a function of the curre nt applied field and t he magnetization. Another example is provided by the 
study of chaotic dispersion in fluids jets l)Cencini et all ^999) . The initial measurement partition here involves the 
character of the motion of the jet, and shows strong memory effects, significantly complicating the analysis. Recent 
work has show n how to eliminate these memory effects , by refining the partition of the state space in just the way we 
suggest above et al .1 l2002ULacorata et adbOQlj) . 

1. An Apparent Counterexample: Disordered Materials 

Amorphous solids l)Zallenlfl983^ and their magnetic equivalents, spin glasses l)Fischer and Hert3.ll988f) are remark- 
able not just because of they display slow dynamics, but because they display distinct dynamics on an immense 
range of time-scales. A crude but graphic illustration is given by ordinary silicate glass. Under mechanical stresses 
with short characteristic times, it is brittle; under stresses with long characteristic times, it is effectively liquid. Spin 
glasses, simil arly, can display distinct susceptibilities to oscillatory magnetic fields over sixteen orders of magnitude 
in frequency ((Fischer and Hert3 . ll988[) . Since the two cases are basically similar, but the requisite physical theory is 
easier to grasp for spin glasses, we will concentrate on them. 

This hierarchy of time-scales implies that memory effects are very important in disordered materials. Indeed, 
many of the usual assumptions made in discussions of statistical mechanics, such as having an "aged" ensemble at 
equilibrium, are simply nonsensical in these cases. At low temperatures, the slowest time-scales can be geologically 
significant. To work with samples which have aged into equilibrium requires liter ally inhuman longevity (to say 
nothing of patience). While technologies which would allow this have been proposed l|Dvsoniri979|) . they are not yet 
common in laboratories. Have we here found substances where our approach to eliminating memory effects breaks 
down? And what does one do, if one cannot use properly aged and equilibrated ensembles? 

The physical mechanism responsible for the long time-scales actually holds the answer to both questions. Each 
spin in a spin glass participates in a mixture of ferromagnetic and antifcrromagnetic interactions of varying strengths. 
The result is generally frustration, i.e., no setting of the spins minimizes all interaction energies simultaneously. This 
leads to the existence of numerous local minima in the energy landscape, generally with widely varying energies, and 
so widely varying heights of the barriers separating them. One must either flip many spins at once, or equivalently 



3 For more on reducing dialectical or historical explanations to mechanical ones via causal states, see IShalizil feOOlfl . 
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make many energetically-unfavorable spin flips in succession, to get from one minimum to another. The time it takes 
to pass between minima will generally be exponential in the height of the energy barrier between them, as one expects 
from the Arrhenius equation. (The causes and details of frustration in glass are different, but the overall picture is 
similar.) Thus, on a given time-scale, barriers above a certain height are effectively infinite, i.e., there probability of 
crossing them is negligible. The spin glass is thus effectively confined to a fixed region of phase space. Within this 
region, the local minima define metastable states, with characteristic life-spans, and so relative probabilities, that 
reflect the heights of the barriers surrounding them. 

We can thus see the way to eliminating memory effects: one takes as one's macrovariables the occupancy probabilities 
of the metastable local minima. Those in the effectively-inaccessible region do not contribute. Within the accessible 
region of phase space, there is a more-or-less gradual leakage of probability from the initial metastable state to the 
others. To extrapolate this forward, however, we do not need to know the history of that seepage, merely the current 
distribution over the local minima. In fact, this is a com mon theoretical ploy, so metimes spoken of as employing "a 
macroscopic number of macroscopic degrees of freedom" l)Fischer and Hertall988ft . Experimentally, one never studies 
an equilibrium ensemble, but rather one that is always aging, and it is precisely the aging properties which are of 
interest! 



C. The Macrostates Are Finer than the Causal States 

Suppose X is a refinement of S. Then some distinct values of the macrovariables have exactly the same consequences 
for the future evolution of the macrovariables. The distinction between those macrostates is meaningless, and some 
of the details in those macrostates is superfluous. There are several reasons, by no means mutually exclusive, why 
this might be so. 

First, some of our variables could be irrelevant, given the others. More precisely, future events could be statistically 
independent of the value of variable Y given the present value of other variables X. It is hard to find examples of 
this in statistical mechanics proper, simply because those variables have been subject to a long process of (informal) 
selection for relevance, but it is easy to find examples of this in other domains of scientific inquiry. Techniques for 
identifying, or constructing, combinations of variables which render others irre levant play a major role in statistical 
methods of causal inference l)Pearll2000HSt)irtes. Glvmour and Schemes! l200l|) . (Note that if one macrovariable is a 
deterministic function of the others, then we get the same partition of T whether or not we adjoin it to the others. 
Similarly, the partition of histories we get is the same.) 

Second, our observational procedure could encode an "unphysical" distinction. In nematic liquid crystals, for 
instance, an important role is played by the "director" , a local vector indicating the average direction of orientation 
of the rod-shaped molecules in the neighborhood of a point. However, the molecules in a nematic are symmetric when 
their long axis is inverted, so the director is not a norm al vector, but one in which opposite vectors are identified, i.e., 
n = — n. llCollingsl 119901 Ide" Gennes and ProstL 11993ft . If we did not know this, however, and tried to observe the 
director as an ordinary vector, we would find that which of two opposite observations we got for the director would 
be a matter of pure chance, i.e., an artifact, and that we would retain full predictive power if we identified opposite 
director vectors, i.e., if we coarsened our observational partition. 

Finally, we may have an unpredictable variable, in the following sense. On the one hand, it takes on significantly 
different values in regions of the phase space which are visited under the dynamics. On the other hand, given the 
time scale separating our observations, the dynamics randomizes those values so thoroughly that little or no effective 
prediction of the variable is possible. In these cases, the variable "washes out" from the partition which maximizes 
predictive power, namely Q. In extreme cases, none of the variables has any predictive power, at the time-scale and 
resolution available to us, so the observed process becomes a sequence of IID random variables, and Q becomes the 
trivial partition on T. For example, consider a liter of ideal gas at standard temperature and fixed, normal molecule 
number. If we observe pressure and internal energy (to reasonable precision) at intervals of one year, the dynamics 
will ha ve so thorough ly mixed phase space that our original observations will have absolutely no predictive value at 
all. (cf. IShaliziL l200ll ch. 12). In such cases, there is simply no point in making predictions, and one's resources are 
better used elsewhere. 4 



4 In the real world, it is often not obvious when a variable contains predictive information, at least at the time- and resolution- scale of 
interest, and great efforts can be devoted to ever-more-elaborate deterministic models of what are, to all intents and purposes, coin-flips. 
For an example of causal state reconstruction showing that some variables contained no useful information, and how recognition of this 
led to improve predictions, see lPalmer. Fairall and Brewer! i200(Tl . Of course, variables which are effectively IID over long times or at 
coarse resoluti on can contain a lot of p redictive information at finer scales. We will return to this point later; cf. the hierarchical scaling 
complexities of Badii and Politi (1997). 
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D. The Physical Meaning of the Causal States 

Starting from the observational variables, one can construct the the causal states, and from them the minimal 
coarse-grained observation which allows for optimal prediction of the original observables. If the two do not coincide, 
one can profitably replace the original set of observables with a new one, either by adjoining new observations or by 
eliminating unphysical distinctions or variables without predictive power. In any case, we can construct, from the 
original macrovariables, a new set of macrovariables whose macrostates are their own causal states. 

These well-constructed macrostates have a number of properties it is worth noting. First, their statistical complexity 
is just the amount of information the macrovariables contain about the microstate — how much our uncertainty about 
the microstate is reduced by learning the macrostate. Second, the macrostates are Markovian. This means that they 
will be mixing just when they satisfy the conditions for Markov processes to be mixing. This can be true even when 
T is not mixin g. Third, again because the macrostates are Markovian, there is a Gibbs distribution over sequences of 
macrostates l)Bremaudl Il999j) . We have not had to assume any sort of equilibrium property, however, and this may 
be part of the reason why Gibbs distributions are still useful out of equilibrium. 5 

We began with certain arbitrary or subjective decisions, about which variables to observe — about what partition 
T of r to employ. Our desire to have dynamics with good causal properties (Markovianity, etc.) led us first to 
refine that partition by considering observational histories, and then to group together histories in constructing the 
causal states. Whether, at that point, we end up adjoining new variables to our original macrovariables, leaving them 
alone, or even coarsening them, has nothing to do with our experimental decisions or epistemic hankerings, merely the 
purely mechanical, physical, objective mi crodynamics. In the causal sta tes we have arrived, so to speak, at objective 
explanations of subjective quantities, (cf. ICrutchfield and ShaliziL ^999) 

IV. LEVELS OF DESCRIPTION AND EMERGENCE 

Earlier, we raised the puzzle of how different levels of description of the same system can co-exist. The answer, we 
propose, is that different causal states are induced by different measurement partitions. Consider two measurement 
partitions, one a coarsening of the other. The coarser measurement is therefore a function of the finer one, and is on a 
higher, less specific level of description. Suppose the finer, lower-level measurement partition is causal. It is well-known 
that the Markov property does not generally survive coarsening the states, which means that its coarse-grainings will 
not, in general, be their own causal states. The causal states of the coarse-grained measurements are well-defined, 
however, and cannot be any finer than the states of the fine-grained measurement partition. It is possible, however, 
that one does not need to go all the way back to the original partition to find those causal states — in fact, there is 
no reason the coarse-grained measurements cannot be identical to their own causal states. We then have two levels 
of description, and can give a coherent causal account at each level. 

In this section, we explore the uses of these ideas in interpreting statistical mechanics, and suggest a definition 
of emergence. We start by considering the old question of the relationship between molecular dynamics and ther- 
modynamics, in the particularly transparent context of the fluctuations of a gas at equilibrium. This leads us to 
suggest a definition of "emergence" . We then clarify the relationship between generalized hydrodynamics and ther- 
modynamics, and attempt to explain the ubiquity of Gibbs distributions for macroscopic configurations. Finally, we 
look at the practice of cellular automata and lattice gas modeling for examples of deliberately constructing adequate 
coarse-grainings. 

A. Equilibrium Fluctuations and a Definition of Emergence 

Systems prepared "in equilibrium" actually fluctuate continually. If our observations are sufficiently coarse, then we 
will essentially only see fluctuations about equilibrium which leave us in the linear regime. In that case, the Onsager 
theory provi des the tools to describe the fluctuations, and to do so in terms of the same variables which work at 
equilibrium JKeizeitll987ft . 

Consider a cu bic centimeter of argon at fixed temperature, pressure and number. (For a more detailed version of 
what follows, see IShalizil ll 2001L sec. 11.2.3)). The only macrovariable left to fluctuate is the internal energy. One can 
calculate, from the Onsager theory, that the Shannon entropy of the internal energy is 33.3 bits. Taking a time-step 
of one millisecond, the entropy rate, i.e., the rate at which the uncertainty increases, is 4.4 bits. The predictive 



We owe this last suggestion to conversation with Erik van Nimwegen, but are pretty sure he disagrees. 
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information is thus 33.3 — 4.4 = 28.9 bits. In doing the corresponding calculations for the microstates, we start with 
the fact that the microstate is its own causal state, since (almost by definit ion) it is Marko vian. Thus C M = 6.6 • 10 20 
bits. If we take the time-step to be one nanosecond, one can estimate h„ jfiasmrd 1199^1 to be 3.3 • 10 20 bits, with 
E = 3.3 • 10 20 bits. 

Following lPalmeJ l)200l|) . we define the predictive efficiency of a process as the fraction of the information it contains 
which actually effects the future, i.e., as the ratio E/C M . We then see that the macrostates can be predicted with 
much higher efficiency (0.87) than the microstates (0.5). Indeed, this comparison is rather unfair to the macrostates, 
since we are predicting them over a much longer time-scale. If we predicted them at the same time resolution as the 
microstates, we would find that the efficiency of prediction was essentially one. Conversely, if we tried to predict the 
microstates at the macro time-scale, we would find an efficiency of prediction of essentially zero. Yet the macrovariables 
are transparently a function, a coarse-graining, of the microvariables. 

This leads us to define a relation of "emergence" between two sets of causal variables if (1) one is a coarse-graining 
of the other and (2) the coarse-grained variables can be predicted more efficiently. In this sense, we can be precise 
about the long-standing intuition that thermodynamics emerges from statistical mechanics: thermodynamic variables 
are more informative about their own dynamics. This also gives us a hint as to what constitutes a good set of 
macrovariables: it should not just be causally complete, but also more predictively efficient than the microvariables. 
This is not always the case; sometimes coarse-grainings are less efficiently predictable than the original variable, a 
condition which J. P. Crutchfield (personal communication) has designated "submergence". 



B. Hydrodynamics and Levels of Description 

One of the more important developments in statistical mechanics and condensed matter physics has been the rise 
of "generali zed hydrodynamics," where description cent ers on the local densities of extensive quantities and order 
parameters (|Chaikin and Lubenskvl Il995l iFbrsteit fl975|) . Normal hydrodynamics is included as a special case. We 
are not going to expound this theory, interesting though it is. Rather, we wish to draw out two points. 

The first is that many (perhaps all) systems which are adequately described at the hydrodynamic level can also be 
described, accurately but less precisely, at the thermodynamic level. This is perfectly sensible from our point of view. 
If one starts with observations of local densities, it is extremely unlikely that these will be adequately predicted from 
purely global quantities. The causal states one forms remain, therefore, tied to local densities. Conversely, knowledge 
of the local densities is excessive if all you want to predict are their global averages or sums. The two descriptions 
coexist, because they are intended to a nswer different questions — not because one is more objective than the other. 

Second, one can show ijShaliziL |200l[) that the relationship between the hydrodynamic description and the thermo- 
dynamic one is generally one of emergence, in the sense described above. This i s comforting, since one can generally 
"contract" hydrodynamic descriptions into thermodynamic ones llKeizeit fl987). Similarly, the hydr odynamic leve l 
itself emerges from the thermodynamic one. Third, when one constructs local causal states ( following IShaliziL l200l|) . 
one finds that they generally form a Markov random field. 6 Consequently, there is a Gibbs distribution over their 
configurations. Now, there are many examples of hydrodynamic systems, strongly non-equilibrium in their (standard) 
thermodynamics, where there are nonetheless important objec ts which follows Gibbs distributions with various kinds 
of effective interaction po tentials dCros s and Hohenberel Il9 93') . Perhaps the most striking case of this is vortex lines 
in turbulent fluids — see IChorinl Jim^ for a full treatment. For conventional statistical mechanics, this is just so 
much dumb luck, but from our point of view, it indicates that the vortex lines (or other coherent, structuring objects) 
are the local causal states. Or rather: if what we find doesn't look Gibbsian, it means we can do better. 



C. Building Coarse Grainings in Cellular Automata and Lattice Gases 

Cellular automata and lattice gases are fully-discretized classical field theories. That is, time is discrete, space is a 
discrete regular lattice, and each point or "cell" can take one of a finite number of states at any one time. The state of 
each cell at time t + 1 is a fixed, possibly stochastic, function of the state of the cell at time t, along with the states of 
the cells in a fixed "neighborhood," thus preserving the nice classical prop erty of local interaction. (All cells updat e 
in parallel.) Originally introduced to model mechanical self-reproduction llPoundstonel 119841 Ivon Neumann! fl 966). 
cellular automata have prove d useful as models of many natural phenomena ijChooard and DrozL [l998: Guto wit zLll991 ; 
iRothman and Zaleski Il997]) . as well as mathematically fascinating objects in their own right ( Griffc ath and Moore . 



No counter-examples are known. Whether they always form a Markov random field is currently an open question. 
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Iforthcomineft . They are important to us here because (1) dynamic models of spin systems are stochastic CAs, and 
(2) they illustrate the strategy we are advocating, at least as a matter of tacit practice. 

When one simulates a CA, one knows, exactly, both the underlying microstate and its dynamics. It can nonetheless 
be very har d to say what i t will do and why it will do i t. This i s not simply because some CA have high computational 
complexity l|Burksl Il970t iGriffeath and MooreL Il996t iMooreL 11997ft . Rather, it is because the raw microstate is too 
detailed to be of use — C M is much too high. One gains understanding by deliberately throwing away most of 
the microscopic in formation, finding instead coarse-grained observations where the dynamics are simpler to grasp 
l)Crutchfieldlll992ft . Generally, this means constructing macrostates with well-behaved Markovian dynamics. There 
are n umerous examples of this stra tegy in the literature, including the many derivat ions of hydrodyna mics in lattice 
gases ijRothman and Zaleskil[l997ft . the theory of heat conduction in the Creutz CA lSaito et q£l ll999). or the vortex 
dynamics of the zero-temperature Potts model l)Moore. Nordahl. Minar and ShaliziL Il999|) . The goal, always, is to 
throw away as much detail as possible, while retaining information relevant to certain aspects of the large-scale 
dynamics — to find simple but accurate r epresentations. (Simple, inaccurate representations are of course easy 
to find.) Spatial computational mechanics l|Crutchfield and Hansonl Il99.lt iHansonl Il993t lHanson and CrutchfieldL 
Il997t iHordiik. Shalizi and CrutchfieldL l200lh provides tools whereby one can automatically find and filter out low- 
information patterns, concentrating one's attention on higher-level, information-rich emergent structures. 



V. WHAT IS NOT BEING SAID 

There are a number of puzzles about macrovariables which our arguments do not resolve. 

1. Why are so many good macrovariables extensive quantities? It certainly does not seem to follow from the fact 
that complete sets of macrovariables are causal states. We suspect, however, that something could be made of the 
following sketch of an argument. Extensive variables, by definition, add across su b-sys t ems. If thos e sub-systems are 
independent, or nearly so, then their totals will have large deviation properties ijEllisl Il985l Il999ft . In other words, 
they will become increasingly well-behaved, statistically, in large systems. This makes them good candidates for 
experimental observation. Conceivably, there are many extensive variables, other than the ones we commonly observe, 
which, while subject to large-deviation principles for their additive fluctuations, are ill-behaved (non-Markovian) over 
time, and so we ignore them. Conversely, when approximate independence across sub-systems is violated, the good 
macrovariables are non-extensive, and we need Tsallis statistics (rather than the usual central- limit-theorem statistics). 

2. Why are almost all good macrovariables derivatives of thermodynamic potentials? A deflating answer would be 
that candidate thermodynamic potentials are under intense selection pressure for just this property. 

3. Why are some good macrovariables reusable, e.g., why is temperature a good macrovariable for almost everything? 
It is not simply that we can only observe a few variables and so observe them for almost everything, because order 
parameters, for instance, are generally not reusable (no sense measuring the nematic director in an antiferromagnet). 
Moreover, why do these variables typically take similar roles in systems with radically different microphysics? (E.g., 
temperature again.) To the best of our knowledge, no one has an answer to these questions; we certainly don't. 

Finally, our approach does not say why it is legitimate to treat a large, locally unstable mechanical system stochas- 
tically in the first place. We have simply assumed that, since we are considering coarse-grained observations, it 
is legitimate to deal with them statistically. While the maximum en tropy principle provides a sup erficially attrac- 
tive ju stification for this, it is open to grave philosophical objections l)Guttmannl . Il999l ISklarl Il993ft . Worse. lAmaril 
(2001) has shown that maximum entropy distributions are simply those that minimize the degree of statistical de- 
pendence between variables. 7 If distributions evolve towards minimal dependency, that is surely just a contingent 
fact about the dynamics, rather than a unive rsal principle of inference. We believe that the answer lies rather in 
ergodic theory ijKhinchirl 119491 iMackevl Il992ft . particularly recent de velopments which emphasize the rapid mixin g 
of low-dimensional projections of high-dimensional smooth dynamics ijDorfmanl Il998l iGaspardl Il998t iRuelld . fl 999ft . 
plus the philosophical assumption that the initial conditions of the world are "generic" . 



Intuitively, this makes sense. The entropy, in bits, is the minimum mean number of binary variables needed to specify a sample 
drawn from the distribution. Dependencies can be used to shorten the description, hence maximizing the entropy requires minimizing 
dependencies. Showing this in detail requires an excursion through information geometry, and we refer the interested reader to Amari's 
paper. 
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VI. CONCLUSION 

Let us recap by way of telling a story. Nameless men in black approach us, as reputable practitioners of statistical 
mechanics, with a physical system, in this case, a beaker full of gooey, shiny black stuff that sometimes moves 
spontaneously. We are able to probe certain aspects of it by physically coupling to it — e.g., we can X-ray it, take 
photographs, attach voltmeters, scatter neutrons through it. The men in black want us to predict certain properties 
of the black oil, say, what will cause it to quiver in different ways. A mixture of interest and feasibility thus dictates an 
initial choice of macrovariable. Given this, we attempt to refine our predictive capabilities by considering histories of 
observations. From them we construct causal states. If the causal states do not coincide with our initial observables, 
we either supplement them with new variables, in a way which can be determined from the causal states construction; 
or we eliminate unphysical distinctions and unpredictive variables, again on the basis of the causal states. When we 
need supplementary variables, we can either devise new experimental methods to observe them, probing new aspects 
of the physics, or we can merely construct them logically, from histories of our original observables. At the end of 
this process we have the minimal set of variables from which we can optimally predict the macrovariables of interest; 
ones which are, moreover, causally complete. 

Our initial choice of macrovariables is the product of our ability to observe the system, and our choices about what 
to predict. Beyond that initial choice, and the requirement that good macrostates have certain causal properties, 
the causal state we use are completely out of our control, fixed entirely by the objective, microphysical dynamics. 
A different set of initial variables will, generally, lead to a different set of causal states. Sometimes, but not always, 
these causal states are related in a hierarchy of emergence. One might put it like this: for every question we ask It, 
Nature has a definite answer; but Nature has no preferred questions. 
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APPENDIX A: Information Theory 

The information contained in a discrete random variable X, also called its entropy or Shannon entropy, is 

H[X] ee -^Pr(A = x)log 2 Pr(A=x) 

X 

= -<log 2 PrpO) 

It is the smallest number of bits (binary distinctions) needed, on average, to specify the value of X. We may think of 
it as the uncertainty an ideal observer, who knew the true ensemble X is drawn from, would have about X. 
The joint entropy of two variables X and Y is defined similarly, 

H[X, Y] = - ]T Pr(X = x,Y = y) log 2 Pr(X = x,Y = y) 

x,y 

It is easy to show that H[X, Y] < H[X] + H[Y], with equality if and only if X and Y are statistically independent. 
The conditional entropy of X given Y, is 

H[X\Y] ee H[X,Y] - H[Y] 

= - w = y)Y, Pr ( x = x \ Y = y) lo §2 P < X = X \ Y = V) 

y x 

= ]TPr(y = y)H[X\Y = y] 

y 
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Then H[X\Y = y] is the information needed to specify X in the sub-ensemble where Y has the value y, and 
is the average information remaining in X given Y. 
Finally, the mutual information between X and Y is 

I[X;Y] = H[X]+H[Y]-H[X,Y] 
= H[X] - H[X\Y) 
= H[Y] - H[Y\X] < H[Y] , 

that is, the amount by which knowledge of one variable reduces the uncertainty in the other. 

APPENDIX B: Markov processes 

Suppose a process generates a probability distribution over time series, . . . , St-i, s t ,s t +i, .... It is a (first-order) 
Markov process if 

Pr(s t+ i = s\% = . . . ,st_!,s t ) = Pr(s t+1 = s\s t ) . 

In other words, the only dependence that St+i has on its entire past history is on its current state St, and previous 
values yield no additional information about its future. Examples of Markov processes include: 

• A deterministic dynamical system where s t +i — f(st) for some function / 

• A scries of fair coin flips, where Pr(s t+ i = s) = 1/2 for s e {heads, tails} 

• Brownian motion, where Pr(s t+ i = x\s t = y) = ,f(\x — y\) for a Gaussian function / 

As an example of a non-Markovian process, suppose St is the position at time t of a particle which is moving at 
constant velocity. Here st+i depends on s t and St-i, i.e. its dynamics is second-order, so there are correlations with 
the past that are not captured by the current state. On the other hand, if we expand our set of observables so that 
s t includes both the particle's position and its velocity, then the process becomes first-order Markovian. 

APPENDIX C: Partitions 

A partition P of a set il is a set Pi of subsets of O which are mutually exclusive and jointly exhaustive. That is, 
Pi n Pj =0 (unless i = j), and il = |J. Pi. The sets in P are the cells of the partition. 

An equivalence relation or equivalence ~ on O is a relation which is reflexive, symmetric and transitive: a ~ a, 
(a <~ b) {b ~ a) and (a ~ b) A (b ~ c) =>■ (a ~ c). The equivalence class of a point is the set of all points which are 
equivalent to it. We write the equivalence class of x as [x\; [x] = {y\x ~ y}. Since every point is equivalent to itself, 
every point has a non-empty equivalence class, and every point belongs to some equivalence class. 

Proposition 1 Every partition corresponds to an equivalence relation, and vice versa. Equivalence classes are cells 
of the partition. 

Proof. First, we construct an equivalence relation from a partition. Simply say that x <~ y iff x and y are in 
the same cell. This is symmetric, reflexive and transitive, hence an equivalence. Now we build a partition from an 
equivalence relation. We claim that the equivalence classes arc mutually exclusive and jointly exhaustive. Mutually 
exclusive means that either [x] = [y] or [x] n [y] = 0, for all x and y. To see this, consider any point z ~ x. Now, 
y ~ z if and only if y ~ x — if by transitivity, and only if likewise. Hence x <~ y iff [x] = [y]. If x 76 y, then there 
cannot exist even one z such that z ~ x, and so no point belongs to the intersection of [x] and [y]. Since every point 
has an equivalence class, the set of equivalence classes is exhaustive. QED. 

Every function / on f2 induces an equivalence relation ~/, thus: a <~/ b iff /(a) = f(b). Similarly every equivalence 
relation defines an (infinite) class of functions: give each equivalence class a unique label and map points to their 
equivalence-class labels. Hence every function defines a partition and vice versa. 

The identity partition is one where each cell contains only a single element of 0, i.e., where each equivalence class 
consists of a single point. The trivial partition is the one which contains only a single cell, equal to 0. 

One partition, A, is finer than another, B, iff for each a € A, there exists a b <G B such that a C b, and, for at 
least one a, a C b. Then A is a refinement of B, and B is coarser than A. Refinement always increases cardinality. If 
neither A nor B is a refinement of one another, and they are not equal, they are incomparable. 

Let A and B be two partitions. Construct all the sets formed by taking the intersection of one cell from 
A with one cell from B. This collection is also a partition, the product of A and B. Symbolically, A ■ B = 
{c\3a e A, b e 6, c = a n b}. It is a refinement of both A and B. 
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Proposition 2 (Factoring Refinements) Let A be a partition and B be any refinement of A. Then there exists a 
minimal factor partition C such that B = A - C and B ^ A ■ D for any D with fewer cells than C. If A and B are 
both finite, then the minimal factor partitions are themselves finite, and there is a finite number of them. 

Proof. We construct a minimal factor C. Each cell a of A contains n a cells from B; a = bf- Let N be the 
maximum of n a over all the cells of A. Define Cj = UaeA Clearly C = {c,} is a partition, and equally clearly its 
product with A will be B. Any partition whose product with A is B must have a cardinality of at least N, because 
it must break (at least) one cell of A into N sub-cells. Hence we have constructed a minimal C whose product with 
A gives the desired refinement. Moreover any D such that B = A ■ D must be a refinement of some minimal factor 
C. The number of minimal factors is at most the number of ways of labeling the subcells of A, viz., IlaeA na '- 

Note that minimal factors are incomparable to A. They are not refinements, because each cell of the factor is not 
entirely contained within a single cell of A. But conversely, there are cells in A which are not entirely contained within 
a single cell of the factor. Obviously, they are not equal to A. Hence they are incomparable. 
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